Following BGC detection and annotation, all antiSMASH and GECCO BGCs were aggregated with experimentally validated BGCs from the Minimum Information about a Biosynthetic Gene cluster (MIBiG) database (MIBiG version 3.1). This resulted in a total set of 9059 BGCs, which were clustered into 1640 Gene Cluster Families (GCFs).

To explore and visualise the relationships between the identified GCFs, dimensionality reduction was performed using the graph-based Uniform Manifold Approximation and Projection (UMAP) tool.

Clustree

Shows the relationship between clusterings at different resolutions. The resolution was set to 0.5, resulting in 9 clusters.

UMAPs

Seurat clusters

Some of the clusters that seemed to overlap in the 2D UMAP are more separated in the 3D version (e.g. cluster 3).

Class

0 1 2 3 4 5 6 7 8
mixed 182 49 48 20 43 0 5 14 1
NRPS 8 174 59 14 11 3 10 11 3
PKS 217 3 41 4 29 1 6 4 3
RiPP 2 3 10 122 14 51 5 5 3
saccharide 0 0 0 0 0 0 1 0 0
terpene 0 9 1 0 5 12 21 13 1
unknown 9 58 66 29 43 56 44 32 62

There is considerable overlap between the assigned Seurat clusters and the predicted biosynthetic class. This suggests that the biosynthetic class is a good parameter for cluster similarity.

Cluster Length

Method

MIBiG = GCF contains at least one MIBiG BGC -> known function
mixed = does not contain any MIBiG BGCs, but both GECCO and antiSMASH
GECCO/antiSMASH = exclusively contains BGCs detected with that method

Almost all clusters contain at least one MIBiG BGC. Only 33 don’t.
Var1 Freq
antiSMASH 8
Gecco 7
MIBiG 1607
mixed 18

Next Steps: Analysing individual Clusters

Clusters containing both MIBiG and antiSMASH/GECCO BGCs

GCF Number of BGCs genomes that contain at least one BGC (%)
GCF0000070 143 15.50218
GCF0000073 149 18.34061
GCF0000075 110 21.17904
GCF0000082 268 37.77293
GCF0000092 264 28.38428
GCF0000100 1009 100.00000
GCF0000923 367 75.10917

Clusters with no MIBiG BGCs

mixed NRPS PKS RiPP unknown
0 0 0 0 0 0
1 0 1 0 0 0
2 1 1 0 1 2
3 2 1 0 6 0
4 0 0 1 0 1
5 0 0 0 12 2
6 0 0 0 0 0
7 0 0 1 0 0
8 0 0 0 1 0

Almost half of those (14/33) are in Cluster 5. The Cluster itself includes 123 GCF.

GCF cluster_length class GCF_method GCF_rep Number of BGCs
GCF0000086 10195 RiPP mixed GCA_002157665_antiSMASH_BDOS01000001.1.region009 2
GCF0000088 10204 RiPP mixed GCA_002213005_antiSMASH_NBFC01000006.1.region001 366
GCF0000089 5212 RiPP antiSMASH GCA_000522725_antiSMASH_ALYW01000080.1.region001 1
GCF0000093 6922 unknown Gecco GCA_020531065_JAJBGA010000047.1_cluster_1 1
GCF0000095 2294 RiPP antiSMASH GCA_020529845_antiSMASH_JAJBDA010000106.region001 2
GCF0000096 3151 unknown mixed GCA_015668935_JADOZJ010000006.1_cluster_3 8
GCF0000098 3959 RiPP antiSMASH GCA_000339575_antiSMASH_AHTD01000072.1.region001 4
GCF0000101 1397 RiPP antiSMASH GCA_000340135_antiSMASH_AHTC01000107.1.region001 1
GCF0000104 1823 RiPP antiSMASH GCA_020529725_antiSMASH_JAJBCW010000149.region001 1
GCF0000105 1221 RiPP mixed GCA_033485055_JAVKRV010000007.1_cluster_1 9
GCF0000107 10231 RiPP mixed GCA_020529495_antiSMASH_JAJBCM010000002.region001 848
GCF0000350 10228 RiPP mixed GCA_012641545_antiSMASH_RXYY01000001.1.region003 30
GCF0001639 2695 RiPP Gecco GCA_002155285_CP021318.1_cluster_6 1
GCF0001640 2276 RiPP antiSMASH GCA_020530685_antiSMASH_JAJBED010000074.region001 1

Strain GCA_019048645

-> orderable type strain

cluster_id gcf_id class method GCF_method
2208 GCA_019048645_antiSMASH_CP077404.1.region001 GCF0000099 NRPS antiSMASH mixed
2209 GCA_019048645_antiSMASH_CP077404.1.region003 GCF0000091 NRPS antiSMASH mixed
2210 GCA_019048645_antiSMASH_CP077404.1.region004 GCF0000103 NRPS antiSMASH mixed
2211 GCA_019048645_antiSMASH_CP077404.1.region005 GCF0000107 NRPS antiSMASH mixed
2212 GCA_019048645_antiSMASH_CP077404.1.region007 GCF0000088 NRPS antiSMASH mixed
2213 GCA_019048645_CP077404.1_cluster_1 GCF0000099 NRPS gecco mixed
2214 GCA_019048645_CP077404.1_cluster_3 GCF0000091 NRPS gecco mixed
2215 GCA_019048645_CP077404.1_cluster_4 GCF0000103 NRPS gecco mixed
2216 GCA_019048645_CP077404.1_cluster_5 GCF0000107 NRPS gecco mixed
2217 GCA_019048645_CP077404.1_cluster_8 GCF0000088 NRPS gecco mixed

Number of BGCs in each GCF (novel GCF that includes at least one BGC from the genome GCA_019048645):

## [1] "GCF0000099 :  1079"
## [1] "GCF0000091 :  425"
## [1] "GCF0000103 :  832"
## [1] "GCF0000107 :  848"
## [1] "GCF0000088 :  366"